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Abstract 

In continuation to a recent work on the statistical-mechanical analysis of minimum mean 
square error (MMSE) estimation in Gaussian noise via its relation to the mutual information 
(the I-MMSE relation), here we propose a simple and more direct relationship between opti- 
mum estimation and certain information measures (e.g., the information density and the Fisher 
information), which can be viewed as partition functions and hence are amenable to analysis 
using statistical-mechanical techniques. The proposed approach has several advantages, most 
notably, its applicability to general sources and channels, as opposed to the I-MMSE relation 
and its variants which hold only for certain classes of channels (e.g., additive white Gaussian 
noise channels). We then demonstrate the derivation of the conditional mean estimator and 
the MMSE in a few examples. Two of these examples turn out to be generalizable to a fairly 
wide class of sources and channels. For this class, the proposed approach is shown to yield 
an approximate conditional mean estimator and an MMSE formula that has the flavor of a 
single-letter expression. We also show how our approach can easily be generalized to situations 
of mismatched estimation. 

Index Terms: Conditional mean estimation, minimum mean squared error, partition function, 
statistical mechanics, Fisher information. 



1 Introduction 

Relationships between signal estimation, signal detection, and information measures, both in dis- 
crete time and continuous time, have been known for decades [1],[3],[8] and have gained a remarkable 
degree of revived interest and research activity in the last several years, see, e.g., [4], [5], [6], [7], 
[11], [12], [13] and references therein. 
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In particular, in [5], Guo, Shamai and Verdu have derived a relation between the mutual infor- 
mation between the input and the output of an additive white Gaussian noise (AWGN) channel and 
the minimum mean squared error (MMSE) of non-causal estimation of the channel input based on 
its output. In particular, this relation, which is often called the I-MMSE relation, shows that the 
derivative of the mutual information with respect to (w.r.t.) the signal-to-noise (SNR) is equal 
to half of the MMSE, and it is intimately related to the de Bruijn identity [2, Sec. 17.7]. Later, 
this relation has been generalized and further developed in several directions: Guo, Shamai, and 
Verdu [6] and Raginsky and Coleman [12] have derived relations of the same spirit for more general 
additive channels. Palomar and Verdu [11] have studied relations between the covariance matrix 
of the MMSE estimator and arbitrary gradients of the mutual information for a general vector 
Gaussian channel, which allows also a linear transformation of the input signal. In [7], relations 
between information measures and estimation measures have been derived for Poisson channels. 
More recently, Verdu [13] extended the I-MMSE relation of Gaussian noise to the paradigm of mis- 
matched conditional mean estimation, that is, to deal with an estimator that is optimally matched 
to a wrong probability distribution assumed on the input signal. The excess mean squared error 
(MSE) due to this mismatch was shown to be related to the Kullback-Leibler divergence between 
the channel output distributions corresponding to the true and the assumed input distributions 
(see also [4] for a further study in this direction). In [9], the I-MMSE relation was further in- 
vestigated from a statistical physics perspective, where among other results, it was demonstrated 
how statistical-mechanical tools can be harnessed in order to assess the MMSE via the I-MMSE 
relation of [5], using the fact that in many cases, the mutual information can be viewed as the 
partition function of a certain physical system. 

This paper is a further development in the above described direction of [9]. The main idea 
is that, for the purpose of evaluating the covariance matrix of the MMSE estimator, one may 
use a conceptually simple and more direct relationship between the MMSE covariance matrix and 
other information measures, that can also be presented in the form of a certain partition function 
and hence be analyzed using methods of statistical physics. The main advantage of the proposed 
approach, over those of the I-MMSE relations and its variants, is its full generality: It applies, in 
principle, to any joint probability function P(x,y) of the channel input signal x = (x±, . . . ,x n ), 
to be estimated, and the channel output y = (yi, . . . ,y m ) (where m and n are positive integers), 
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provided that certain technical regularity conditions hold. The channel P(y\x) does not even have 
to be additive, as opposed to the assumptions made in [6] and [12]. Moreover, the dimension m of 
the channel output vector y does not have to be the same as the dimension n of the input vector 
x. 

In a nutshell, the idea is to define, for a given n-vector of real-valued parameters A = (Ai, . . . , A n ), 
the 'partition function' 



where we have implicitly assumed that x takes on discrete values, otherwise, the sum should simply 
be replaced by an integral. Now, it is straightforward to show that the gradient of In Z(y, A) w.r.t. 
A, computed at A = 0, gives the conditional mean estimator x = E{X\y}, whereas the expectation 
of the Hessian of the same function, again, at A = 0, gives the error covariance matrix of the MMSE 
estimator. As we shall see in the sequel, In Z(y, A) lends itself to closed form analytic evaluation 
(in the spirit of a single-letter formula) in a fairly wide spectrum of situations, using methods of 
statistical mechanics. Thus, the MMSE estimator and its performance can quite easily be derived 
too in these situations. Moreover, as was demonstrated extensively in [9], the statistical-mechanical 
perspective on estimation-theoretic problems, may offer, not only analysis techniques, but also some 
important insights with regard to threshold effects (whenever existent) via the inspection of possible 
phase transitions in the parallel statistical-mechanical model. 

Besides the general applicability of this approach, it has several additional advantages: 

1. As mentioned in the previous paragraph, it provides, not only the MMSE error covariance 
matrix, but also the conditional mean estimator itself. 

2. As will be seen, several variants of these relations between estimation measures and informa- 
tion measures can be offered. In some cases, one of the relations may be more convenient to 
work with than the others. 

3. The approach is easy to extend to the mismatched case. Furthermore, it allows mismatch in 
both the source and the channel (as opposed to [13], which allows mismatch in the source 
only). 
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The remaining part of this paper is organized as follows. In Section 2, we establish notation 
conventions. In Section 3, we first derive the basic relations between the conditional mean esti- 
mator, as well as its error covariance matrix, and the above-mentioned partition function. In the 
same section, we also discuss this relation and derive a few variants that involve also information 
measures, like the information density, the Fisher information, etc. We also outline the extension 
to mismatched estimation. In Section 4, we provide three examples. In Section 5, we show how two 
of them set the stage to the analysis of a more general class of joint distributions, P(x, y). Finally, 
in Section 6, we summarize and conclude the paper. 

2 Notation Conventions 

Throughout this paper, scalar random variables (RV's) will be denoted by capital letters, their 
sample values will be denoted by the respective lower case letters, and their alphabets will be 
denoted by the respective calligraphic letters. A similar convention will apply to random vectors 
and their sample values, which will be denoted with same symbols in the bold face font. Thus, for 
example, X will denote a random vector (X±, . . . ,X n ), and x = (x±, . . . specific vector 

value in X n , the n-th Cartesian power of X. The notations y\ and Y? , where i and j are integers 
and i < j, will designate segments (yi, . . . , yj) and (Yi, . . . , Yj), respectively. 

Probability functions will be denoted generically by the letter P or Q. In particular, P(x,y) is 
the joint probability mass function (in the discrete case) or the joint density (in the continuous case) 
of the desired channel input vector x = (x\,. . . ,x n ) and the observed channel output vector y = 
(yi, . . . ,y m ). Accordingly, P(x) will denote the marginal of x, P(y\x) will denote the conditional 
probability mass (or density) of y given x, induced by the channel, and so on. Whenever there is 
room for ambiguity, these probability functions will be subscripted by the names of the random 
variables and the conditionings, according to standard notation conventions in probability theory 
and information theory. Throughout the sequel, we will assume discrete valued alphabets, mostly 
for the sake of simplicity and convenience. Extensions to continuous valued situations will be 
straightforward with summations being replaced by integrations, etc. Indeed, some of our examples 
will involve continuous valued random variables. 

The expectation operator of a generic function f(x, y) w.r.t. the joint distribution P of (X, Y) 
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will be denoted by E{f(X,Y)}. The conditional expectation of the same function given that 
Y = y, denoted E{f(X, Y)\Y = y}, and which is obviously identical to E{f(X, y)\Y = y}, is, of 
course, a function of y. On substituting Y in this function, this becomes then a random variable 
which will be denoted by E{f(X,Y)\Y}. When using vectors and matrices in a linear-algebraic 
format, n-dimensional vectors, like x (and X), will be understood as column vectors, the operator 
will denote vector or matrix transposition, and so, x T would be a row vector. For two positive 
sequences {a n } and {b n }, the notation a n = b n means equivalence in the exponential order, i.e., 
lim^oo i log(a„/6 n ) = 0. Finally, the indicator function of an event A will be denoted by l{-4}. 
I.e., l{-4} = 1 is A occurs, and 1{Z} = if not. 

3 MMSE Estimation Relations 

This section consists of two subsections. In the first, we derive the main basic relations and in the 
second, we show how to extend the scope to the case of mismatched estimation. 

3.1 Basic Relations 

Let X = (Xi, . . . , X n ), and Y = (Yi, . . . , Y m ) (n and m being positive integers), be two random 
vectors, jointly distributed according to a given probability function P(x, y). It is further assumed 
that the alphabet X, of each component of X, consists of a set of real valued numbers, i.e., 
X C IR. This assumption is obviously necessary in order to make the problem of estimating 
X, in the MSE sense, a meaningful problem. The conditional mean estimator of X based on 
Y, i.e., X = E{X\Y} is well-known to be the optimum estimator in the MSE sense, i.e., it 
minimizes the MSE E{(Xi - X) 2 } for alH = 1, 2, . . . , n. The MMSE in estimating Xi is then 
E{(Xi — E{Xi\Y}) 2 }, i.e., the expected conditional variance of Xi given Y. More generally, 
the MMSE error covariance matrix E is an n x n matrix whose (i, j)-th element is given by 
E{(Xi — E{Xi\Y})(Xj — E{Xj\Y})}. This matrix can be represented as the expectation (w.r.t. 
Y) of the conditional covariance matrix of X given Y, henceforth denoted Cov{X|Y}. I.e., 

E = E{Cav{X\Y}} = E{XX T } - E{E{X\Y} ■ E{X T \Y}}. 
Defining a column vector of n real valued parameters, A = (Ai, . . . , A„) T , consider the following 
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function: 

Z(y,\)± J2 exp{\ T x}P(x,y) = ]T exp{X T x}P(x)P(y\x), 

xex n xex n 

where it is assumed that the sum (or integral, in the continuous case) converges uniformly at least 
in some neighborhood of A = 0. 1 It is straightforward to see now that: 



d\uZ(y\\) 



i.e., 



E{X\y} = V x lnZ(y,X), (2) 

where denotes the gradient w.r.t. A. Similarly, upon taking second order derivatives, one obtains 
d 2 lnZ(y\\) 



dXjdXj 



and so, 



and 

A=o 



= E{XiXj\y} - E{Xi\y} ■ E{Xi\y} = Cov{A^X,|y}, 

A=o 

E = E\v\\nZ(Y,\) ), (3) 

where V| is the Hessian w.r.t. A, namely, the matrix of second order derivatives w.r.t. pairs of 
components of A. Note that here and throughout the sequel, we will always refer to gradients and 
Hessians of functions w.r.t. A, computed at the point A = 0. It will therefore be convenient to use, 
for a generic function g, the shorthand notations Vog(A) and Vqp(A) to designate V^g(X) 

V A <7(A) 

Another, perhaps simpler, way to look at the relations (2) and (3) is the following: Obviously, 
for a given y, M(y,X) = J2x e X P{ x \y) is the moment generating function pertaining to the 
conditional distribution of x given y and so, its derivatives relative to {Aj}, computed at A = 0, 
yield the conditional moments E{Xi\y}, E{Xf\y}, E{X{Xj\y}, etc. Therefore, lnM(y,A) is a 
generator of the corresponding conditional cumulants, E{Xi\y}, Yar{Xi\y}, CovjA^, Xj\y}, etc. 
Now, observe that In M(y, A) differs from In Z(y, A) merely by the additive term In P(y), which does 
not depend on A anyway and hence does not affect the gradient and Hessian w.r.t. A. Therefore, 
In Z(y, A) is a generator of conditional cumulants, exactly like lnM(y,A). An important point, 
however, is that we prefer In Z(y,X) over lnM(y, A) because normally, it is more convenient to 



, respectively. 

A=o 



1 If this assumption is not met, one can instead, parametrize each component Xi of A as a purely imaginary number 
\i — jiOi (j — T), as is done in the definition of the characteristic function. 
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work with the joint distribution P(x, y) (or equivalently, with the source P(x) and forward channel 
P(y\x)) rather than with the backward channel (or the posterior) P(x\y). 2 

We next derive several alternative versions of this relation between the error covariance matrix 
of the MMSE estimator and derivatives of InZ. First, observe that Z(y,X) is proportional to 
P\(y) - ©(A), where 

6(A) = P(x)exp{\ T x} 

X£X n 

and P\(y) is the output marginal of y induced by the channel P(y\x) and the modified source 
distribution P\(x) = x P(x)/Q{\). We therefore obtain 

E = £?{vglnZ(Y,A)} 

= e{vIhp x (y).@(\)}} 

= V 2 lnQ(X) + E{v 2 lnP x (Y)} 

= Cov{X}-J, (4) 

where Cov{X} = E{XX T } — E{X} ■ E{X T } is the covariance matrix of X and J is the Fisher in- 
formation matrix of estimating A based on Y, computed at the point A = 0. The Fisher information 
matrix J can also be expressed as 

J = E {V ln P X (Y) • V%\nP(Y\\)} . 

Equivalently, we obtained 

J = Cov{X} — E = E{E{X\Y} ■ E{X T \Y}}. 

Note that J can also be obtained as the negative expectation of the Hessian (or, equivalently, as 
the covariance matrix of the gradient) of the information density [14], 

i x (x;y)=\n[P(y\x)/P x (y)}, 

2 As a side remark, we shall mention also the physical perspective: if Z(y, A) is thought of as the partition function 
of a certain statistical-mechanical model (as discussed in the Introduction), where the components of A are thought 
of as certain generalized forces or fields that are acting on the individual particles, then the above relation between the 
second order derivative of \nZ(y, A) w.r.t. \i and Xj and the (conditional) covariances between the corresponding 
state variables, Xi and Xj, is known as one of the versions of the fluctuation-dissipation theorem in statistical 
mechanics [10, p. 32, eq. (2.44)], which relates between the linear response of the system (to an infinitesimally small 
perturbation in its parameters) and its fluctuations in equilibrium. 
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which is again, computed at A = 0. 

Sometimes it is more convenient to square the first derivative of In Z than to take the second 
derivative. In these cases, the following relationship may be useful: 

H = E{[V \nZ(Y,\)]-[V \nZ(Y,\)] T } 

= 2?{[V m{P(Y|A) • 9(A)}] • [V ln{P(Y|A) • G(A)}] T } 

= E{[V \nP(Y\X)} • [V lnP(Y|A)] T } + [V In 6(A)] • [V lne(A)] T 

= J + E{X}-E{X T } 

= Cov{X} + E{X} ■ E{X T } - E 

= E{XX T } - E (5) 

and so, 

E = E{XX T } - E. 
Particularizing these results to the MMSE, 

n 

mmse(X\Y) = £ E{(Xi - E{X % \Y}) 2 }, 
i=i 

which is the trace of E, we have the following relations, which we formulate as a proposition. 



Proposition 1. The following formulas for the MMSE hold: 

' d 2 In Z(Y,X) 



mmse(X\Y) = ^ E 

i=i 

n 

= E 

i=\ 

n 

= E 

i=l L 

n 

= E 



A=o 



i=l 



Var{Xi} + E 
Yai{X,i} - E 

e{x?}-e\ 



\d 2 In P(Y\X) 




..)] 




d\ 2 


A 




f 


'd\nP{Y\X) 


l 2 






[ dX t 




A=o, 


-dlnZ(Y,\y 


2 


A=o} 




dXi J 







(6) 
(7) 
(8) 
(9) 



In the second and the third formulas, lnP(y|A) can be replaced by \ni(X;Y), thus relating the 
MMSE to the information density. 
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3.2 Extension to the Mismatched Case 



In this short subsection, we are outlining how our approach can easily be extended to handle situ- 
ations of mismatched estimation. Consider a mismatched estimator which is the conditional mean 
of X given Y, based on an incorrect joint distribution Q(x,y), whereas the true joint distribution 
continues to be P(x, y). Denoting by Zp(y, A) and Zq(y, A) the corresponding partition functions, 
and by Ep and Eq, the corresponding expectations, our approach can easily be generalized to 
handle this case as follows: 



E = 



E P {(X- E Q {X\Y})(X T - E Q {X T \Y})} 
= E P {XX T } - Ep{E P {X\Y}E Q {X T \Y}} - 

Ep{E Q {X\Y}E P {X T \Y}} + E P {E Q {X\Y}E Q {X T \Y}} 
= E P {XX T } - E P {[Vo In Z P (Y, A)] • [V In Z Q (Y, A)] T } - 

E P {[V \nZ Q (Y,X)} ■ [V lnZp(Y,\)} T } + Ep{[V lnZ Q (Y,X)} • [V In Z Q (Y , A)] T }. 

Thus, in particular, the MSE associated with the mismatched estimator is given by 

mse Q (X|F) = ±\Ep{Xf}-2E P { dlnZ ^ Y ^ 



i=i 
+ Ep 



din Z Q {Y,X) 



A=o 



A=o 



din Zq(Y, A) 




"}] 




A=o. 





(10) 



4 Examples 

In this section, we provide three examples, where we show how the log-partition function, In Z(y, A), 
can be evaluated for large n, using methods of statistical mechanics. Using the relations derived in 
Subsection 3.1, we then show how the conditional mean estimator and the MMSE can be approxi- 
mated for large n. 

4.1 Example 1 - A Codeword Transmitted Over an AWGN 

Our first example is taken from [9, Subsection 5.2], but here we demonstrate how to derive the 
conditional mean estimator and the MMSE using Proposition 1, rather than the I-MMSE relation. 
For the sake of completeness and convenience, we provide here the full necessary details (with 
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the appropriate modifications to accommodate the method proposed herein), including those that 
already appear in [9]. As noted in [9], the analysis of this model is intimately related to one of 
the statistical mechanical techniques used in the analysis of the so called random energy model 
(REM) of disordered magnetic materials, a.k.a. spin glasses in the statistical physics literature (see 
references in [9]). 

Let X be chosen uniformly at random from a codebook C = {xq, X\ , . . . , xm-i} of size M = e nR . 
The codebook itself is also selected at random (and then revealed to the estimator) in the following 
manner: Each Xi is selected independently and uniformly at random from the surface of a sphere 
of radius \fn~P x centered at the origin. The channel P(y\x) is an AWGN channel (hence m = n) 
whose noise variance is (keeping the same notation as in [9]). I.e., 

P(vM=(0" 2 e,p{fllv-*l| 2 }. 

Thus, for a given y, we have: 

z( y ,x) = ^2 e ' nRe M-P\\y-^\\ 2 ^ + ^ T ^} 

xec 

= e - n/? exp[-/?||y-* || 2 /2 + A T ; C o]+ £ e~ nR exp[-/3||y - x\\ 2 /2 + \ T x] 

xeC\{x } 

= Z c {y,X) + Z e {y,X), (11) 



where, without loss of generality, xq designates the transmitted codeword. Now, since \\y — Xo\\ is 
typically around n/(3, Z c (y,X) would typically be about e -nR e -f3-n/(2f5) e X T x = e -n(R+i/2)+X T x (K 

As for Z e (y,X), we have: 

Z e (y,X) = e- nR [ deN(e)e-^, 

where iV(e) is the number of codewords {x} in C — {xq} for which \\y — x\\ 2 /2 — X T x/(3 ~ ne, 
namely, between ne and n(e + de). Now, given y, N(e) = J2iLi l{^i : \\y — Xi\\ 2 /2 — X T x/(3 ps ne} 
is the sum of M i.i.d. Bernoulli random variables and so, its expectation is 

M 

N(e) = ]TPr{||y - - X T X t /P « ne} = e nR Pr{\\y - X^/2 - X T X l /(5 « ne}. (12) 

i=i 

Denoting P y = \ Ya=i vt (typically, P y is about P x + 1/(3), the event \\y — x\\ 2 /2 — X T x/(3 w ne is 

equivalent to the event x T (y + X/ 0) « [(P x + P y )/2 — e]n or equivalently, 

. a x T {y + \/0) \(P x + P y )-e A P a -e 
p{x, y) = 1 ~ ■/= — = , , 
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where have defined P a = (P x + P y )/2 and P' g = ^P*, where P' y = + V/3) 2 . The 

probability that a randomly chosen vector X on the sphere would have an empirical correlation 
coefficient p with a given vector y' = y + \/ (3 (that is, X falls within a cone of half angle arccos(p) 
around y') is exponentially exp [f ln(l - p 2 )]. For convenience, let us define 



r(p) = iln(l-p 2 ) 



so that we can write 



Vx{\\y - Xi||72 - X 1 X x /f3 « ne} = exp InT 



If e is such that 



Pa-e 



then the energy level e will be typically populated with an exponential number of codewords, 
concentrated very strongly around its mean 

'Pa-e 



N(e) = exp < n 



R + T 



P' 

9 



otherwise (which means that N(e) is exponentially small), the energy level e will not be populated 
by any codewords typically. This means that the populated energy levels range between 



ei = P a - P'Vl - e~ 



2R 



and 



e 2 = Pg + P' g Vl-e-™, 



or equivalently, the populated values of p range between — and where = Vl — e~ 2R . By 
large deviations and saddle-point methods, it follows that for a typical realization of the randomly 
chosen code, we have 



Z e (y,\) = e nR max exp < n 

ee[ei,e 2 ] \ 



R + r 



max exp < n 

«G[ei,£2] 1 



Pn 



P' 

9 



Pa-e 

. P, 9 , 



(3e 



(3e 



expjn max |I l n (l - p 2 ) - /3(P a - | 



(13) 
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The derivative of \ ln(l — p 2 ) + pf3P' w.r.t. p vanishes within [—1, 1] at: 



where 



P/3 



e 



Vi + e 2 - 



2(3P' 



This is the maximizer as long as \/l + 2 — 6 < p*, namely, 9 > e 2R /2p 1fl or equivalently, (3 < 
p*e 2R /P' g , which for P' g = y/P x {P x + 1/13) (||A|| is small), is equivalent to (3 < (3 R = {e 2R -1)/P X . 
Thus, for the typical code we have 



Ze(P\v) 



exp {n [i ln(l - p 2 ) - (3(P a - p p P' g )\ } , < fa 
exp{-n[i? + (3{P a - p*P' g )]}, ' f3 > (3 R . 



Taking now into account Z c (y, A), it is easy to see that for (3 > (3 R (which means R < C), Z c (y, A) 
dominates Z e (y, A), whereas for (3 < (3 R it is the other way around. It follows then that 

exp jn [\ ln(l - p 2 ) - (3(P a - Pf} P' g )] } , f3 < (3 R 
expj-n^ + i) +A T ^ }, P > Pr- 



Z(y,x) 



A very similar analysis applies also to the derivative In Z(y, A), which is essentially a weighted 



average of X{ with weights proportional to N(e)e~@ e for all e £ [£1,62]- Thus, the exponentially 
dominant weight is due to the term that maximizes the exponent. Assuming that the correct code- 
word xq is dominant (Z c » Z e , which is the case when R < C), this weighted average is obviously 
dominated by the i-th component of xq, in which case the MMSE essentially vanishes. Otherwise, 
for R > C, Z e dominates the partition function and the weighted average is overwhelmingly domi- 
nated by the term corresponding to the maximizing e, or equivalently, the maximizing p, which is 
pp. This means that the conditional mean estimator of is approximately given by: 



E{Xi\y} 



d 



ln(l - p 2 ) + (3 Pp nP' g 



= n 



OK \ 



2 dx t 



dX t 



1-^ 



dp' 



= (3 ppn ■ 



VPx 2y t 
IJPZ ' (3n 
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= ^Typ- y - (14) 

where in the last step we have used the identity pp = y/P x / {P~x + 1//?), which can easily be verified. 
This is simply the linear Wiener estimator that would have been applied had the input been zero- 
mean, i.i.d. Gaussian, with variance 1//3 (see also [9]). According to Proposition 1, the MMSE 
associated with X; is given by 

E^Xi - E{X % \Y}) 2 } *P X - E{E 2 (X t \Y)} = P x - ( P \. \ • (P x + 1/(5) Pr 



P X + 1/PJ v " ' 1 + /WV 

as expected. 

4.2 Example 2 - The Curie Weiss Model 

Consider a binary source 

{/ n \ ^ n j 

where a and b are parameters and C n is a normalization constant, which is immaterial for our 
purposes (as it is going to disappear upon taking derivatives w.r.t. {Aj}, and the same comment 
applies to the constants C' n and below). Let the channel be binary and symmetric, i.e., 

Then, the partition function Z(y, A) can be represented as a one-dimensional integral using the 
Hubbard-Stratonovich transform, which in turn can be assessed using saddle point methods, as is 
frequently done in the statistical physics literature. Specifically, we have the following: 

{/ n \ 2 n n n j 

zn \i=i / i= i i= i i= i j 

= c;^exp|^x i (/3y i + A, + 6) + ^^x^ 1 (16) 



C^exp {^XiiPyi + Xi + b) • / dflexp -— + #]T 
a; U=i J I la »=i 



a* (17) 
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too 



= C'l dfle-" e /( 2a )]>>xp + A, + 6 + 0) (18) 

J -°° x L=i J 

/-|-oo ^ 
d0e -ney(2a) [2 cosh(/3yi + Aj + ft + fl)] ( 19 ) 

-°° i=l 

/•+°° ( nf)? JL 1 

= 2 n C; t 'y d0exp | -— + ]T>cosh(/3y i + A i + & + 0)j. (20) 

Thus, 

ainZ(y,A) -gg d0tanh(/% + A, + 6 + 0) exp + gti lncosh(/% + A, + b + 0)} 

~ J+^d0exp{-^ + £™ =1 lncosh(/% l + A, + 6 + 0)} 

« tanh(/% + A; + 6 + 0*), (21) 

where 0* is the maximizer of the expression at the exponent, i.e., it is the solution to the zero- 
derivative equation: 

n 

= - Vtanh(/% + A i + 6 + 0). 

n f-f 

i=i 

Thus, the MMSE estimator is: 

f+~ d0tanh(/% + 6 + 0) exp {-^ + £f=i lncosh(/% + 6 + 0)} 

E\Xi\y\ = f ; (22) 

/+~ d0 exp { - + ^ cosh(/% + 6 + 0) } 

« tanh(/% + 6 + 0*), (23) 

where now 0* is understood to be taken with A = 0. For 6/0, the asymptotic MMSE is then 
given by 

lim mmSe(X|r) = 1 - 2?{tanh 2 (/?Y + 6 + O )}, 

n— >oo 

where 0o is the solution to the equation 

= a.E{tanh(/3Y + 6 + 0)}, 

and where Y is a binary {±1} RV, with mean m*tanh(/3), m* being the dominant solution to the 
equation m = tanh(am + 6), i.e., the maximizer of ^((1 + m)/2) + am 2 /2 + 6m, where /i2( - ) is the 
binary entropy function. When 6 = 0, 0q becomes a random variable which takes on, with equal 
probabilities, one of two values, each one being the solution to the above displayed equation, except 
that in one of them Y has mean m* tanh(/3) and in the other, its mean is — m* tanh(/3). 

This calculation is intimately related to the Curie-Weiss model of magnetic spins [10, Subsection 
2.5.2, pp. 40-44], where the parameter m plays the role of magnetization. 
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4.3 Example 3 The Generalized Multivariate Cauchy Noise Model 

Let Xi ~ A/"(0, a 2 ) be i.i.d. RV's, and let the additive noise have a generalized multivariate Cauchy 
distribution, i.e., 

P(y\x) ( '< lk 



[l + (y-x) T S{y-x)] k 

where C n ^ is a normalization constant, S is a positive definite matrix, and k > is chosen large 
enough (as a function of n) such J^ndz/fl + z T Sz] k < oo, i.e., k > n/2. The choice /c = 
(n + l)/2 corresponds to the ordinary multivariate Cauchy distribution. Here, however, we will 
require moreover that k is even large enough such that the second moments exist, i.e., / R »dz • 
z T z/[l + z T Sz] k < oo, which means k > n/2 + 1. For simplicity, we will take S to be the identity 
matrix. However, our analysis easily extends to a general positive matrix S, as well as to a general 
Gaussian vector X, not necessarily with i.i.d. components. Using the Laplace transform identity 
J Q °° dt • t k ~ x e~ st = T(k)/s k , we have: 

so,,*, - / Kiid ,P ( ^._^_ (24) 

= C B>fc / dxP(x)e Xx di-*— . e-^M-^ (25) 
JIR n Jo L{k) 

= C' nk dt-t^e't dxP{x)e X x. e -tHM-^) 2 (26) 
' JO JIR™ 

/* OO ^ 

= C" fe / dt ■ t k - l e- 1 TT / dx.e-^/^e^ • e"'^-^) 2 (27) 



and so, 



d\nZ{y,\) 



Io°° ^ • T^e-H k - X exp {-§ ln(l + 2ta 2 ) - ^ E. 



A=o J °° d*e-*t*-i exp {-§ ln(l + 2ta 2 ) - ^ E* y 2 } 

which can be approximated by it/i/(i+l/2a 2 ), where t is the value oft that dominates the integral, 
i.e., 

i = argmax t (k - 1) In t - ^ ln(l + 2ta 2 ) - - - ^ ^ ?, 

i 

The derivation of the MMSE can be done in a similar manner as in the previous examples. 
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5 Joint Distributions with Generalized Spherical Symmetry 



Examples 2 and 3 of the previous section have one idea in common. In both of them we expressed 
either the source or the channel as a one-dimensional integral over a variable (t or 6, in those 
examples), where for each value of this variable, we have a product form measure, which enables, 
after applying saddle point analysis on this integral, to pass to a closed-form formula, which has 
the flavor of a single-letter characterization. In this section, we generalize this idea to establish a 
somewhat more general framework. 

Suppose that m = n and the joint distribution of X and Y is of the form 

P[x,y) = F n (J2(l>( x i,yi))- 

i 

Let f n (t) be the inverse Laplace transform of F n (s). Then, we have 
Z(y,X) = f dxe XTx P(x,y) 

= dxe XTx J™ dt/ n (t) exp j-t <t>{ Xi , yi ) | 

= dt/ n (t) dxe XTx exp | -t J2 <Kxi,Vi) } 

roo r 

= / dtf n (t)U / ds i e AiXi exp{-^(s i>J / i )}. (29) 
Jo Jtr 

Before proceeding, we should note that by using the Laplace transform, we have essentially rep- 
resented the joint distribution of X and Y as a mixture of product form measures, indexed 
by t, each being proportional to exp{— i J2i 4>{ x i-, Hi)}- If we normalize these measures by Z" = 
ExeA 1 J2yey ex P{ — t(fi(x, y)}] n , and define the i.i.d. probability distribution 

exp{-tY,i<l>(xi,yi)} 



P(x,y\t) 



Z? 



then P(x,y) is essentially expressed here as a mixture of i.i.d. probability functions {P(x,y\t)}, 
where t can be thought of as a random parameter whose prior is given by w n (t) = f n (t)Z^. 
However, it should be kept in mind that this integral representation goes somewhat further than 
being a mixture of i.i.d. distributions because f n (t), and hence also w n (t), may be negative for 
some ranges of t even when F n (s) is strictly positive for all s. For example, recall that the inverse 
Laplace transform of s 2 /(s 2 + a 2 ) is sin(crf) (t > 0), and so, for F n (s) = a n /(s 2 + a 2 ) n , f n (t) is 
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given by the n-fold convolution of sin(crf) with itself. In such cases, P(x, y) cannot be considered 
a mixture of i.i.d. distributions. 



Let us now denote 



and 



Then, 



and so, 



p{\y,t) = In 
Po(y,t) = p{0,y,t) = In 

dp(\,y,t) 



dxe \x-t4>(x,y) 



C(y,t) 



OX 



dxe -t<K*,y) 



A=o J R dx-e^M 
Z(y.X) ! dtf n (t)eZi 



E{Xi\y} = 



din Z(y,X) 



dX, 



_ fZ°dtf n (t)gyi,t)eZ i rty*t) 



which is approximated by ((yi,t), where t is the maximizer of the expression 

\n\f n (t)\ + J2po(yi,t). 

% 

The MMSE of estimating Xj is given by 

mmse(X,|y) « E{X 2 } - E{( 2 (Yi,t (t))} 
where the second term is computed as follows: 

E{C 2 (Y u t (t))} = / dtw n (t)E{C 2 (YiMt))\t} 
Jo 

with the inner expectation being 

E{( {Yi,t (t))\t} MePoM 

and with to(t) being the value of t' that maximizes 

_ ln|M ' )l+n J u dyeMy,t) ■ 
Thus, we have characterized both the conditional mean estimator and the MMSE in the spirit of a 
single-letter formula for this class of joint distributions. 

The following further extensions of this formalism are conceptually straightforward: 
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1. The range of the variable t may not necessarily be [0, oo). Our above analysis applies to 
whatever range as long as the integrals exist. 

2. The joint distribution P(x,y) may be a function of more than one statistic J2i 4 > { x i-,yi)i i- e -> 

(n n \ 

i=l i=l / 

In this case, one may apply a Laplace transform of a higher dimension 

/*oo /*oo 

F(s u ..., Sk )= ■■■/ dii-.-d^/^i,....^- 1 * 1 --'***, 

JO JO 

where = * = 1, 2, . . . , fe. 

3. The assumption that the i-th term of J^i'P^iiUi) depends only on the i-th coordinate of 
y is not really necessary. The derivation continues to hold, for example, if we allow more 
generally the form J2i <t>(xi,Vi, Vi-i, ■ ■ Vi-k)- 

4. The case where <f> is a quadratic form can be extended to allow a quadratic form that involves 
all coordinates of x and y collectively, using a positive definite matrix S for weighting. In 
other words, joint distributions with elliptic symmetry are allowed, with the form P(x, y) = 
F n [(x, y) T S(x, y)], where (x, y) denotes the concatenated column vector of dimension (n+m) 
formed by x and y, and the matrix S is of dimension (n + m) x (n + m). In this case, the 
kernel is Gaussian and hence the estimator is linear for a given t. 

6 Conclusion 

In this paper, we have proposed a simple relation between MMSE estimation measures and a certain 
expression, which can be viewed as a partition function, and hence be analyzed using methods of 
statistical mechanics. This partition function is also related to several information measures, like 
the information density and the Fisher information. The proposed approach has several advantages 
over the I-MMSE relation and its variants: 

1. It is conceptually simple and direct. 

2. It applies in full generality, for every joint distribution of the desired random vector X and 
its noisy observation vector Y. 
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3. It provides, not only the MMSE error covariance matrix, but also the conditional mean 
estimator itself x = E{X\y}. 

4. It offers several alternative expressions of the MMSE (see Proposition 1). 

5. The approach is easy to extend to the mismatched case and it allows mismatch, not only in 
the marginal of X, but in the entire joint density P(x,y). 

Finally, considering earlier work on the I-MMSE relation and its various variants that were discussed 
in the Introduction, it would be natural to seek relations between MMSE estimation to the Hessian 
of the mutual information. One can show, using the same techniques as in Subsection 3.1, that the 
following relation holds: 

E = V 2 I X (X; Y) + Cov{X} - Cov { (X - E{X})(X - E{X}f , In |^^} , 

where I\(X; Y) is the mutual information induced by the joint distribution 

AMj ~ 6(A) • 

Unfortunately, this relation seems somewhat more complicated and not as useful as the I-MMSE 
relation of [5] or the relations proposed in Subsection 3.1 herein. 



19 



References 

[1] R. S. Bucy, "Information and filtering," Information Sciences, vol. 18, pp. 179-187, 1979. 

[2] T. M. Cover and J. A. Thomas, Elements of Information Theory, John Wiley & Sons, Hobo- 
ken, NJ, U.S.A., 2006. 

[3] T. E. Duncan, "On the calculation of mutual information," SIAM Journal on Applied Math- 
ematics, vol. 19, no. 1, pp. 215-220, 1970. 

[4] D. Guo, "Relative entropy and score function: new information-estimation relationships 
through arbitrary additive perturbations," Proc. ISIT 2009, Seoul, South Korea, June-July 
2009. 

[5] D. Guo, S. Shamai, and S. Verdu, "Mutual information and minimum mean-square error in 
Gaussian channels," IEEE Trans. Inform. Theory, vol. 51, no. 4, pp. 1261-1282, April 2005. 

[6] D. Guo, S. Shamai, and S. Verdu, "Additive non-Gaussian noise channels: mutual infor- 
mation and conditional mean estimation," Proc. 2005 IEEE Symp. on Inform. Theory (SIT 
2005), pp. 719-723, Adelaide, Australia, September 2005. 

[7] D. Guo, S. Shamai, and S. Verdu, "Mutual information and conditional mean estimation in 
Poisson channels," IEEE Trans. Inform. Theory, vol. 54, no. 5, pp. 1187-1849, May 2008. 

[8] T. Kailath, "The innovations approach to detection and estimation theory," Proc. of the 
IEEE, vol. 58, no. 5, pp. 680-695, May 1970. 

[9] N. Merhav, D. Guo, and S. Shamai (Shitz), "Statistical physics of signal estimation in Gaus- 
sian noise: theory and examples of phase transitions," to appear in IEEE Trans. Inform. 
Theory, March 2010. 

[10] M. Mezard and A. Montanari, Information, Physics, and Computation, Oxford University 
Press, 2009. 

[11] D. P. Palomar and S. Verdu, "Gradient of mutual information in linear vector Gaussian 
channels," IEEE Trans. Inform. Theory, vol. 52, no. 1, pp. 141-154, January 2006. 



20 



[12] M. Raginsky and T. P. Coleman, "Mutual information and posterior estimates in channels 
of exponential family type," Proc. 2009 IEEE Workshop on Inform. Theory, pp. 399-403, 
Taormina, Sicily, October 2009. 

[13] S. Verdii, "Mismatched estimation and relative entropy," Proc. ISIT 2009, Seoul, South 
Korea, June- July 2009. 

[14] S. Verdii and T. S. Han, "A general formula for channel capacity," IEEE Trans. Inform. The- 
ory, vol. IT-40, no. 4, pp. 1147-1157, July 1994. 



21 



